首页> 外文OA文献 >Incremental Entity Resolution from Linked Documents
【2h】

Incremental Entity Resolution from Linked Documents

机译:链接文档的增量实体解析

摘要

In many government applications we often find that information aboutentities, such as persons, are available in disparate data sources such aspassports, driving licences, bank accounts, and income tax records. Similarscenarios are commonplace in large enterprises having multiple customer,supplier, or partner databases. Each data source maintains different aspects ofan entity, and resolving entities based on these attributes is a well-studiedproblem. However, in many cases documents in one source reference those inothers; e.g., a person may provide his driving-licence number while applyingfor a passport, or vice-versa. These links define relationships betweendocuments of the same entity (as opposed to inter-entity relationships, whichare also often used for resolution). In this paper we describe an algorithm tocluster documents that are highly likely to belong to the same entity byexploiting inter-document references in addition to attribute similarity. Ourtechnique uses a combination of iterative graph-traversal, locality-sensitivehashing, iterative match-merge, and graph-clustering to discover uniqueentities based on a document corpus. A unique feature of our technique is thatnew sets of documents can be added incrementally while having to re-resolveonly a small subset of a previously resolved entity-document collection. Wepresent performance and quality results on two data-sets: a real-world databaseof companies and a large synthetically generated `population' database. We alsodemonstrate benefit of using inter-document references for clustering in theform of enhanced recall of documents for resolution.
机译:在许多政府应用程序中,我们经常发现有关实体的信息(例如人)可从不同的数据源中获得,例如护照,驾驶执照,银行账户和所得税记录。类似的情况在具有多个客户,供应商或合作伙伴数据库的大型企业中很常见。每个数据源都维护一个实体的不同方面,并且基于这些属性解析实体是一个经过充分研究的问题。但是,在许多情况下,一个来源的文档引用了其他来源的文档。例如,一个人可以在申请护照时提供其驾驶执照号码,反之亦然。这些链接定义同一实体的文档之间的关系(与实体间的关系相对,实体间的关系也经常用于解析)。在本文中,我们描述了一种算法,该算法除了利用属性相似性之外,还通过利用文档间引用来聚集很有可能属于同一实体的文档。我们的技术结合使用了迭代图遍历,局部敏感哈希,迭代匹配合并和图聚类来发现基于文档语料库的唯一性。我们技术的独特之处在于,可以增量添加新的文档集,而不必重新解析以前解析的实体文档集合的一小部分。我们用两个数据集表示性能和质量结果:公司的真实数据库和综合生成的大型“人口”数据库。我们还展示了使用文档间引用进行聚类的好处,即增强了对文档的调用以解决问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号